Goto

Collaborating Authors

 collection strategy


A Feedback-Control Framework for Efficient Dataset Collection from In-Vehicle Data Streams

arXiv.org Artificial Intelligence

Modern AI systems are increasingly constrained not by model capacity but by the quality and diversity of their data. Despite growing emphasis on data-centric AI, most datasets are still gathered in an open-loop manner which accumulates redundant samples without feedback from the current coverage. This results in inefficient storage, costly labeling, and limited generalization. To address this, this paper introduces Feedback Control Data Collection (FCDC), a paradigm that formulates data collection as a closed-loop control problem. FCDC continuously approximates the state of the collected data distribution using an online probabilistic model and adaptively regulates sample retention using based on feedback signals such as likelihood and Mahalanobis distance. Through this feedback mechanism, the system dynamically balances exploration and exploitation, maintains dataset diversity, and prevents redundancy from accumulating over time. In addition to demonstrating the controllability of FCDC on a synthetic dataset that converges toward a uniform distribution under Gaussian input assumption, experiments on real data streams show that FCDC produces more balanced datasets by 25.9% while reducing data storage by 39.8%. These results demonstrate that data collection itself can be actively controlled, transforming collection from a passive pipeline stage into a self-regulating, feedback-driven process at the core of data-centric AI.


Donate or Create? Comparing Data Collection Strategies for Emotion-labeled Multimodal Social Media Posts

arXiv.org Artificial Intelligence

Accurate modeling of subjective phenomena such as emotion expression requires data annotated with authors' intentions. Commonly such data is collected by asking study participants to donate and label genuine content produced in the real world, or create content fitting particular labels during the study. Asking participants to create content is often simpler to implement and presents fewer risks to participant privacy than data donation. However, it is unclear if and how study-created content may differ from genuine content, and how differences may impact models. We collect study-created and genuine multimodal social media posts labeled for emotion and compare them on several dimensions, including model performance. We find that compared to genuine posts, study-created posts are longer, rely more on their text and less on their images for emotion expression, and focus more on emotion-prototypical events. The samples of participants willing to donate versus create posts are demographically different. Study-created data is valuable to train models that generalize well to genuine data, but realistic effectiveness estimates require genuine data.


On the Benefits of Active Data Collection in Operator Learning

arXiv.org Machine Learning

We investigate active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. Thus, with sufficiently rapid eigenvalue decay of the covariance kernels, arbitrarily fast error convergence rates can be achieved. This contrasts with the passive (i.i.d.) data collection strategies, where the convergence rate is never faster than $\sim n^{-1}$. In fact, for our setting, we establish a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active over passive data collection strategies in operator learning.


Federated Learning with Integrated Sensing, Communication, and Computation: Frameworks and Performance Analysis

arXiv.org Artificial Intelligence

With the emergence of integrated sensing, communication, and computation (ISCC) in the upcoming 6G era, federated learning with ISCC (FL-ISCC), integrating sample collection, local training, and parameter exchange and aggregation, has garnered increasing interest for enhancing training efficiency. Currently, FL-ISCC primarily includes two algorithms: FedAVG-ISCC and FedSGD-ISCC. However, the theoretical understanding of the performance and advantages of these algorithms remains limited. To address this gap, we investigate a general FL-ISCC framework, implementing both FedAVG-ISCC and FedSGD-ISCC. We experimentally demonstrate the substantial potential of the ISCC framework in reducing latency and energy consumption in FL. Furthermore, we provide a theoretical analysis and comparison. The results reveal that:1) Both sample collection and communication errors negatively impact algorithm performance, highlighting the need for careful design to optimize FL-ISCC applications. 2) FedAVG-ISCC performs better than FedSGD-ISCC under IID data due to its advantage with multiple local updates. 3) FedSGD-ISCC is more robust than FedAVG-ISCC under non-IID data, where the multiple local updates in FedAVG-ISCC worsen performance as non-IID data increases. FedSGD-ISCC maintains performance levels similar to IID conditions. 4) FedSGD-ISCC is more resilient to communication errors than FedAVG-ISCC, which suffers from significant performance degradation as communication errors increase.Extensive simulations confirm the effectiveness of the FL-ISCC framework and validate our theoretical analysis.


A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education

arXiv.org Artificial Intelligence

Understanding which student support strategies mitigate dropout and improve student retention is an important part of modern higher educational research. One of the largest challenges institutions of higher learning currently face is the scalability of student support. Part of this is due to the shortage of staff addressing the needs of students, and the subsequent referral pathways associated to provide timeous student support strategies. This is further complicated by the difficulty of these referrals, especially as students are often faced with a combination of administrative, academic, social, and socio-economic challenges. A possible solution to this problem can be a combination of student outcome predictions and applying algorithmic recommender systems within the context of higher education. While much effort and detail has gone into the expansion of explaining algorithmic decision making in this context, there is still a need to develop data collection strategies Therefore, the purpose of this paper is to outline a data collection framework specific to recommender systems within this context in order to reduce collection biases, understand student characteristics, and find an ideal way to infer optimal influences on the student journey. If confirmation biases, challenges in data sparsity and the type of information to collect from students are not addressed, it will have detrimental effects on attempts to assess and evaluate the effects of these systems within higher education.


AI & other emerging tech to revolutionise debt recovery for banks, NBFCs - Express Computer

#artificialintelligence

Established in 2017, Credgenics is a SaaS-based end-to-end debt recovery platform. Presently, over 50 lenders are using the platform, which includes seven banks with notable names like ICICI, Axis, and HDFC and more than 40 NBFCs, such as LoanTap, Drip Capital, Udaan, among others. In the last three years Credgenics has managed to grow MoM from 80–100 per cent. The startup has raised Series-A round funding, with Tanglin Venture Partners and Westbridge Capitals being the main contributors, with participation from the existing investor Accel Partners; the valuation has now reached to US$ 100 million. Credgenics has also on-boarded more than 2200 lawyers and collection-partners, apart from building a solid team of more than 150 enthusiasts and experts.


Manpower Puts Sidetrade's Artificial Intelligence At The Core

#artificialintelligence

With an annual income of €4 bn per year, Manpower France collects 1.3 million receivables from 80,000 companies. To handle this volume, and increasingly complex payment procedures, Manpower's Finance department started using Sidetrade technology in 2013. Sidetrade accelerates automation of the order-to-cash process, and models collection strategies for different segments of clientele. As a result, Manpower France improved their efficiency with a significant reduction in days sales outstanding. Despite this excellent performance, considering the complexity of the purchasing process, and exponential growth in data, Sidetrade decided to enrich their platform with Artificial Intelligence technology.


Manpower puts Sidetrade's Artificial Intelligence at the core of their organization

#artificialintelligence

Manpower France anticipates and drives change in the world of work by breaking new ground with Aimie, Sidetrade's cutting edge Artificial Intelligence system. To optimize Credit Management, Manpower equipped their Finance team with Sidetrade's ground breaking technology, now available to all of Sidetrade's customers. With an annual income of €4 bn per year, Manpower France collects 1.3 million receivables from 80,000 companies. To handle this volume, and increasingly complex payment procedures, Manpower's Finance department started using Sidetrade technology in 2013. Sidetrade accelerates automation of the order-to-cash process, and models collection strategies for different segments of clientele.


Learning Beam Search Policies via Imitation Learning

arXiv.org Artificial Intelligence

Beam search is widely used for approximate decoding in structured prediction problems. Models often use a beam at test time but ignore its existence at train time, and therefore do not explicitly learn how to use the beam. We develop an unifying meta-algorithm for learning beam search policies using imitation learning. In our setting, the beam is part of the model, and not just an artifact of approximate decoding. Our meta-algorithm captures existing learning algorithms and suggests new ones. It also lets us show novel no-regret guarantees for learning beam search policies.


How Utilities Can Use Machine Learning for Bad Debt Control

#artificialintelligence

Bad debt control can be considered a use case under the umbrella of revenue protection. It can result in hard dollar value, making it easier to show the benefits of the project. Different types of revenue protection projects, such as power theft, unaccounted energy, fraud, and bad debt require different methods to identify the issues. The majority of approaches are rule based detections, however, others use machine learning models. For example, a machine learning model can be used to generate credit risk scores for customers.